Conversation
fix release metadata readlocc fix update last sync ts update debug log debug log and debug print ccpage function debug log for range split datasyncscan and debug print ccmap function disable debug log fix slice size mismatch
WalkthroughThis pull request extends transaction service components to support running operations on candidate leader nodes, in addition to current leaders. It refactors term resolution across multiple CC request types and handlers to use request-provided terms with candidate/standby fallbacks, updates method signatures to propagate an Changes
Sequence DiagramsequenceDiagram
actor Client
participant TransactionExecution
participant CcHandler as CcHandler<br/>(LocalCcHandler)
participant Sharder
participant CcRequest as CcRequest<br/>(e.g., ReadCc)
participant CcShard
Client->>TransactionExecution: InitTx(allow_run_on_candidate=true)
TransactionExecution->>TransactionExecution: Store allow_run_on_candidate_
TransactionExecution->>CcHandler: NewTxn(..., allow_run_on_candidate=true)
CcHandler->>CcRequest: Reset(..., allow_run_on_candidate=true)
CcRequest->>CcRequest: Store allow_run_on_candidate flag
Client->>TransactionExecution: Read(...)
TransactionExecution->>CcHandler: Read(..., allow_run_on_candidate_=true)
CcHandler->>CcRequest: Reset(..., allow_run_on_candidate=true)
CcRequest->>Sharder: CandidateLeaderTerm(ng_id)
Sharder-->>CcRequest: candidate_term (if available)
alt candidate_term available and allowed
CcRequest->>CcRequest: ng_term = candidate_term
else fallback
CcRequest->>Sharder: LeaderTerm(ng_id)
Sharder-->>CcRequest: leader_term
CcRequest->>Sharder: StandbyNodeTerm()
Sharder-->>CcRequest: standby_term
CcRequest->>CcRequest: ng_term = max(leader_term, standby_term)
end
CcRequest->>CcShard: Execute(ng_term)
CcShard->>CcShard: Use ng_term for lock/data operations
CcShard-->>CcRequest: Result
CcRequest-->>CcHandler: Result
CcHandler-->>TransactionExecution: Result
TransactionExecution-->>Client: Data
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (10)
tx_service/src/remote/remote_cc_request.cpp (1)
478-495:⚠️ Potential issue | 🔴 CriticalFix
ReadCc::Resetargument order for outside reads.
After insertingallow_run_on_candidate, the outside‑read call shifts arguments (-1→ bool,abort_if_oom→ partition_id). Passpoint_read_on_missexplicitly to keep parameters aligned.🛠️ Suggested fix
ReadCc::Reset(&remote_table_name_, req.schema_version(), &req.key(), req.key_shard_code(), out_record, read_type, input_msg->tx_number(), input_msg->tx_term(), req.ts(), &cc_res_, ToLocalType::ConvertIsolation(req.iso_level()), ToLocalType::ConvertProtocol(req.protocol()), req.is_for_write(), req.is_covering_keys(), nullptr, - false, - -1, - req.abort_if_oom()); + false, // allow_run_on_candidate + false, // point_read_on_miss + -1, // partition_id + req.abort_if_oom());tx_service/src/tx_execution.cpp (1)
2006-2014:⚠️ Potential issue | 🟠 MajorAvoid stale bucket-lock state on cache hit.
Markinglock_range_bucket_result_finished here can cause later ops (e.g., object commands) to skip locking and reuse a stalebucket_rec_whenFastToGetBucket()misses, which can misroute shard codes. Prefer clearing the lock result on cache hits instead of setting it finished.Proposed fix
- // Set the lock range bucket result - lock_range_bucket_result_.SetFinished(); + // Cache hit doesn't populate bucket_rec_; avoid leaking + // prior lock state into later ops. + lock_range_bucket_result_.Value().Reset(); + lock_range_bucket_result_.Reset();tx_service/include/cc/template_cc_map.h (2)
504-512:⚠️ Potential issue | 🔴 CriticalAdd the trailing semicolon after
CODE_FAULT_INJECTOR.Line 512 now ends the macro invocation with
}). If the macro expands to ado { ... } while (0)block (common pattern), omitting the semicolon breaks compilation. Other call sites use});.🛠️ Proposed fix
- }) + });
6053-6087:⚠️ Potential issue | 🔴 CriticalReplace the
leader_term > 0check withCheckLeaderTerm()to properly verify this node is the leader.
GetNodeGroupLeaderTerm()returnsstd::max(LeaderTerm(ng_id), CandidateLeaderTerm(ng_id)), not an indicator of whether the local node is leader. A candidate with a higher term would causeleader_term > 0even though this node is not the leader, triggering the assert incorrectly. UseSharder::Instance().CheckLeaderTerm(cc_ng_id_, current_term)instead to verify the node actually holds leadership before asserting buffered commands are invalid. Also verify thatcurrent_term(fromreq.NodeGroupTerm()) is guaranteed positive at this point.tx_service/include/cc/cc_request.h (1)
402-551:⚠️ Potential issue | 🟠 MajorReset
allow_run_on_candidate_in the string-key Reset to avoid stale state.These requests are pooled; if a prior call set the flag to true, the string-key Reset currently leaves it unchanged and can unintentionally run remote acquires on candidates.
🔧 Suggested fix
is_local_ = false; block_by_lock_ = false; abort_if_oom_ = abort_if_oom; + allow_run_on_candidate_ = false;tx_service/src/cc/local_cc_shards.cpp (1)
3390-3497:⚠️ Potential issue | 🟠 MajorUnpin should also run when term == 0.
With candidate-enabled flows,
TryPinNodeGroupDatacan return 0 on success. The currentif (ng_term > 0)skips unpinning in that case and can leak a pin. Switch to>= 0to match the pin-success contract.Based on learnings: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded).🛠️ Suggested fix
- if (ng_term > 0) + if (ng_term >= 0) { Sharder::Instance().UnpinNodeGroupData(task->node_group_id_); }tx_service/src/cc/local_cc_handler.cpp (4)
108-123:⚠️ Potential issue | 🟠 MajorPropagate
allow_run_on_candidateto remoteAcquireWrite.
Local requests pass the flag intoAcquireCc::Reset, but the remote path drops it, so candidate leaders on other nodes may still be rejected.Proposed fix
remote_hd_.AcquireWrite(cc_shards_.node_id_, ng_id, table_name, schema_version, key, key_shard_code, tx_number, tx_term, command_id, ts, is_insert, hres, hd_res_idx, proto, iso_level, - abort_if_oom); + abort_if_oom, + allow_run_on_candidate);
310-319:⚠️ Potential issue | 🟠 MajorPropagate
allow_run_on_candidateto remotePostWrite.
The local path forwards the flag, but the remote call omits it, creating inconsistent candidate-leader support.Proposed fix
remote_hd_.PostWrite(cc_shards_.node_id_, tx_number, tx_term, command_id, commit_ts, cce_addr, record, operation_type, key_shard_code, - hres); + hres, + allow_run_on_candidate);
413-423:⚠️ Potential issue | 🟠 MajorPropagate
allow_run_on_candidateto remotePostRead.
PostReadCc::Resetgets the flag locally, but the remote call omits it, so candidate-term validation may fail remotely.Proposed fix
remote_hd_.PostRead(cc_shards_.node_id_, tx_number, tx_term, command_id, key_ts, gap_ts, commit_ts, cce_addr, hres, need_remote_resp, - post_read_type); + post_read_type, + allow_run_on_candidate);
792-807:⚠️ Potential issue | 🟠 MajorPass
allow_run_on_candidateinto the string-keyReadCc::Reset.
The TxKey overload forwards the flag, but the string-key overload stops at the archive pointer; candidate runs may still be rejected in request execution.Proposed fix
read_req->Reset(&table_name, 0, key_str, shard_code, &record, read_type, tx_number, tx_term, ts, &hres, iso_level, proto, is_for_write, false, - nullptr); + nullptr, + allow_run_on_candidate);
🤖 Fix all issues with AI agents
In `@tx_service/include/cc/catalog_cc_map.h`:
- Around line 1143-1160: The code currently asserts ng_term > 0 after falling
back via Sharder::Instance() calls; replace this assertion with a runtime error
path: if ng_term <= 0, log a clear error referencing req.NodeGroupId() and the
failed lookups (NodeGroupTerm/StandbyNodeTerm/CandidateLeaderTerm), set the
function's error/Status return (or populate the response error fields used by
this module) and return early instead of proceeding to use ng_term; remove the
assert(ng_term > 0) and ensure no lock metadata is written when ng_term is
invalid.
- Around line 197-199: Replace the assert on ng_term with a runtime guard: check
req.NodeGroupTerm() into ng_term and if ng_term <= 0 return a proper error (or
take the same fallback path used by ReadCc) instead of crashing; update the
block around ng_term,
CODE_FAULT_INJECTOR("term_CatalogCcMap_Execute_PostWriteAllCc", ...) and
subsequent logic to handle the invalid term via the error/fallback path so
downstream code never receives a negative term.
In `@tx_service/include/cc/cluster_config_cc_map.h`:
- Around line 69-71: The assert on ng_term should be replaced with a graceful
validation like other cc_map implementations: after reading uint32_t ng_id =
req.NodeGroupId(); int64_t ng_term = req.NodeGroupTerm(); remove assert(ng_term
> 0) and instead check if (ng_term < 0) and return REQUESTED_NODE_NOT_LEADER (or
the function's error return consistent with catalog_cc_map.h / range_cc_map.h
patterns); ensure you reference ng_term and REQUESTED_NODE_NOT_LEADER so the
function exits cleanly when term is invalid rather than crashing.
In `@tx_service/include/cc/local_cc_handler.h`:
- Around line 110-121: The remote PostRead declaration and plumbing must be
updated to accept and forward the allow_run_on_candidate flag: modify the remote
handler's PostRead signature in remote_cc_handler.h to include the bool
allow_run_on_candidate parameter (matching local_cc_handler.h), update the
implementation that constructs the ValidateRequest RPC message to set the
corresponding field on ValidateRequest, and propagate the flag through any
intermediate call sites and RPC stubs (e.g., methods that build/send
ValidateRequest and the RPC handler) so remote PostRead honors
allow_run_on_candidate. Ensure symbols updated include PostRead,
allow_run_on_candidate, and ValidateRequest.
In `@tx_service/include/cc/local_cc_shards.h`:
- Around line 2450-2454: The compilation fails because
BigTableHandler::GetNextRangePartitionId and
DynamoHandler::GetNextRangePartitionId use override but no matching virtual
exists in DataStoreHandler, and their parameter lists differ from
LocalCcShards::GetNextRangePartitionId; fix by adding a virtual declaration in
DataStoreHandler with the canonical signature used by LocalCcShards — bool
GetNextRangePartitionId(const TableName &tablename, const TableSchema
*table_schema, NodeGroupId ng_id, uint32_t range_cnt, int32_t
&out_next_partition_id) — or alternatively remove override from the handlers;
then update BigTableHandler::GetNextRangePartitionId and
DynamoHandler::GetNextRangePartitionId to match that canonical parameter list
(add table_schema and ng_id, remove retry_count) so all declarations/definitions
are consistent.
In `@tx_service/include/cc/range_bucket_cc_map.h`:
- Around line 92-108: The code currently relies on assert(ng_term > 0) after
trying req.NodeGroupTerm(), CandidateLeaderTerm(), LeaderTerm() and
StandbyNodeTerm(); replace this assert with a runtime error path: after
computing ng_term check if ng_term <= 0 and if so log or construct a descriptive
error including req.NodeGroupId(), the original req.NodeGroupTerm(),
CandidateLeaderTerm(...), LeaderTerm(...), and StandbyNodeTerm(), then throw a
std::runtime_error (or return a failure status according to your error model)
instead of asserting so callers receive a deterministic error rather than
crashing or continuing with invalid metadata.
In `@tx_service/include/cc/range_cc_map.h`:
- Around line 220-236: The code currently asserts ng_term > 0 after trying
CandidateLeaderTerm, LeaderTerm and StandbyNodeTerm; replace the assert with a
proper early return that fails fast when no valid term exists: check if ng_term
<= 0 after those fallbacks and return an appropriate NOT_LEADER or RETRY error
code (the same error path the caller expects) instead of proceeding to call
AcquireCceKeyLock and constructing a CcEntryAddr; reference the
Sharder::Instance() calls and methods CandidateLeaderTerm, LeaderTerm,
StandbyNodeTerm and ensure the function returns the error before any use of
AcquireCceKeyLock/CcEntryAddr.
In `@tx_service/include/cc/template_cc_map.h`:
- Around line 202-203: Replace the assert-based term validation with a runtime
check: in the locations where you do "int64_t ng_term = req.NodeGroupTerm();"
(including the Execute() path and the other site around the current asserts), if
ng_term <= 0 return the same error used by ReadCc (REQUESTED_NODE_NOT_LEADER)
instead of asserting; update the function that returns the RPC/Status to
propagate REQUESTED_NODE_NOT_LEADER so callers handle it like ReadCc (see ReadCc
handling around lines ~1391-1406 for the exact behavior) and remove the
assert(ng_term > 0) usages.
In `@tx_service/include/data_sync_task.h`:
- Around line 186-189: The SetSyncOnLeader() setter and unsynchronized reads of
sync_on_leader_ in DataSyncTask can race; change the DataSyncTask member
sync_on_leader_ to std::atomic<bool> and update SetSyncOnLeader() to call
sync_on_leader_.store(sync_on_leader, std::memory_order_release) and replace all
unsynchronized reads (e.g., in finalization expressions like
task->sync_on_leader_ ? task->data_sync_ts_ : 0) with
sync_on_leader_.load(std::memory_order_acquire) so reads and writes are properly
synchronized.
In `@tx_service/src/cc/cc_req_misc.cpp`:
- Around line 1171-1199: The const method UpdateCceCkptTsCc::ValidTermCheck() is
missing the assert guard used by other const variants; add assert(term_ > 0);
near the start of ValidTermCheck (before the comparison against current_term) so
uninitialized/invalid term_ values are caught early; ensure the assertion is
placed in the body of UpdateCceCkptTsCc::ValidTermCheck() prior to the if
(current_term < 0 || current_term != term_) check and that <cassert> is
available in the translation unit.
🧹 Nitpick comments (2)
tx_service/src/cc/cc_shard.cpp (1)
1072-1075: Resolve the candidate-leader recovery TODO before merge.Please either implement the intended candidate-term behavior here or document why recovery must remain leader-only. I can help draft the fix or open a tracking issue if needed.
tx_service/include/cc/range_cc_map.h (1)
615-618: Keep term resolution consistent for key‑cache initLine 615‑618: this path now relies solely on
req.NodeGroupTerm()with a hardassert(ng_term > 0)but doesn’t reuse the candidate/leader/standby fallback fromReadCc. IfNodeGroupTermisn’t populated for this path, the cache init will assert and bail. Consider reusing the same resolution helper or gracefully skipping cache init when the term is invalid.
| int64_t ng_term = req.NodeGroupTerm(); | ||
| assert(ng_term > 0); | ||
| CODE_FAULT_INJECTOR("term_CatalogCcMap_Execute_PostWriteAllCc", { |
There was a problem hiding this comment.
Guard invalid NodeGroupTerm() instead of assert‑only.
If a request arrives with a negative term, this assert can crash debug builds; in release builds the negative term is used downstream. Prefer a runtime error path (optionally mirroring the ReadCc fallback) so invalid terms are rejected cleanly.
🔧 Proposed fix (robust guard + fallback)
- int64_t ng_term = req.NodeGroupTerm();
- assert(ng_term > 0);
+ int64_t ng_term = req.NodeGroupTerm();
+ if (ng_term < 0)
+ {
+ if (req.AllowRunOnCandidate())
+ {
+ ng_term = Sharder::Instance().CandidateLeaderTerm(req.NodeGroupId());
+ }
+ if (ng_term < 0)
+ {
+ ng_term = Sharder::Instance().LeaderTerm(req.NodeGroupId());
+ int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm();
+ ng_term = std::max(ng_term, standby_node_term);
+ }
+ }
+ if (ng_term <= 0)
+ {
+ req.Result()->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);
+ return true;
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| int64_t ng_term = req.NodeGroupTerm(); | |
| assert(ng_term > 0); | |
| CODE_FAULT_INJECTOR("term_CatalogCcMap_Execute_PostWriteAllCc", { | |
| int64_t ng_term = req.NodeGroupTerm(); | |
| if (ng_term < 0) | |
| { | |
| if (req.AllowRunOnCandidate()) | |
| { | |
| ng_term = Sharder::Instance().CandidateLeaderTerm(req.NodeGroupId()); | |
| } | |
| if (ng_term < 0) | |
| { | |
| ng_term = Sharder::Instance().LeaderTerm(req.NodeGroupId()); | |
| int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm(); | |
| ng_term = std::max(ng_term, standby_node_term); | |
| } | |
| } | |
| if (ng_term <= 0) | |
| { | |
| req.Result()->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER); | |
| return true; | |
| } | |
| CODE_FAULT_INJECTOR("term_CatalogCcMap_Execute_PostWriteAllCc", { |
🤖 Prompt for AI Agents
In `@tx_service/include/cc/catalog_cc_map.h` around lines 197 - 199, Replace the
assert on ng_term with a runtime guard: check req.NodeGroupTerm() into ng_term
and if ng_term <= 0 return a proper error (or take the same fallback path used
by ReadCc) instead of crashing; update the block around ng_term,
CODE_FAULT_INJECTOR("term_CatalogCcMap_Execute_PostWriteAllCc", ...) and
subsequent logic to handle the invalid term via the error/fallback path so
downstream code never receives a negative term.
| uint32_t ng_id = req.NodeGroupId(); | ||
| int64_t ng_term = Sharder::Instance().LeaderTerm(ng_id); | ||
| ng_term = std::max(ng_term, Sharder::Instance().StandbyNodeTerm()); | ||
|
|
||
| if (req.IsInRecovering()) | ||
| int64_t ng_term = req.NodeGroupTerm(); | ||
| if (ng_term < 0) | ||
| { | ||
| ng_term = ng_term > 0 | ||
| ? ng_term | ||
| : Sharder::Instance().CandidateLeaderTerm(ng_id); | ||
| if (req.AllowRunOnCandidate()) | ||
| { | ||
| ng_term = Sharder::Instance().CandidateLeaderTerm(ng_id); | ||
| } | ||
| if (ng_term < 0) | ||
| { | ||
| ng_term = Sharder::Instance().LeaderTerm(ng_id); | ||
| int64_t standby_node_term = | ||
| Sharder::Instance().StandbyNodeTerm(); | ||
| ng_term = std::max(ng_term, standby_node_term); | ||
| } | ||
| } | ||
| assert(ng_term > 0); | ||
|
|
There was a problem hiding this comment.
Handle unresolved term after fallback instead of asserting.
If the leader/standby/candidate lookup still yields a negative term, the current assert can crash and negative terms can leak into lock metadata in release builds. Return a proper error instead.
🔧 Proposed fix (runtime error on invalid term)
- assert(ng_term > 0);
+ if (ng_term <= 0)
+ {
+ req.Result()->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);
+ return true;
+ }🤖 Prompt for AI Agents
In `@tx_service/include/cc/catalog_cc_map.h` around lines 1143 - 1160, The code
currently asserts ng_term > 0 after falling back via Sharder::Instance() calls;
replace this assertion with a runtime error path: if ng_term <= 0, log a clear
error referencing req.NodeGroupId() and the failed lookups
(NodeGroupTerm/StandbyNodeTerm/CandidateLeaderTerm), set the function's
error/Status return (or populate the response error fields used by this module)
and return early instead of proceeding to use ng_term; remove the assert(ng_term
> 0) and ensure no lock metadata is written when ng_term is invalid.
| uint32_t ng_id = req.NodeGroupId(); | ||
| int64_t ng_term = Sharder::Instance().LeaderTerm(ng_id); | ||
|
|
||
| if (ng_term < 0) | ||
| { | ||
| hd_res->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER); | ||
| return true; | ||
| } | ||
| int64_t ng_term = req.NodeGroupTerm(); | ||
| assert(ng_term > 0); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and examine the file and context around lines 69-71
fd -e h -e cc -e cpp "cluster_config_cc_map" | head -5Repository: eloqdata/tx_service
Length of output: 108
🏁 Script executed:
# Search for ValidTermCheck usage and NodeGroupTerm patterns
rg -n "ValidTermCheck" --max-count=20 | head -30Repository: eloqdata/tx_service
Length of output: 2440
🏁 Script executed:
# Search for NodeGroupTerm usage and see the context
rg -n "NodeGroupTerm" -B2 -A2 --max-count=30 | head -50Repository: eloqdata/tx_service
Length of output: 3600
🏁 Script executed:
# Look for AcquireAllCc method and surrounding validation
rg -n "AcquireAllCc" -B5 -A15 | head -60Repository: eloqdata/tx_service
Length of output: 5083
🏁 Script executed:
# Read the cluster_config_cc_map.h file around lines 69-71 with broader context
cat tx_service/include/cc/cluster_config_cc_map.h | head -100Repository: eloqdata/tx_service
Length of output: 4212
🏁 Script executed:
# Search for TryPinNodeGroupData usage in cluster_config_cc_map.h
rg -n "TryPinNodeGroupData" tx_service/include/cc/cluster_config_cc_map.h -B5 -A5Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Check if there's any validation on NodeGroupTerm in the request class
rg -n "class.*AcquireAllCc\|struct.*AcquireAllCc" -A30 | head -80Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Look for where NodeGroupTerm is set/validated in AcquireAll request
rg -n "NodeGroupTerm\(\)" -B10 -A2 tx_service/include/cc/cluster_config_cc_map.h | head -50Repository: eloqdata/tx_service
Length of output: 605
🏁 Script executed:
# Read more of the Execute method to see how ng_term is used
sed -n '61,150p' tx_service/include/cc/cluster_config_cc_map.hRepository: eloqdata/tx_service
Length of output: 4187
🏁 Script executed:
# Check if AcquireAllCc has ValidTermCheck method
rg -n "class AcquireAllCc\|struct AcquireAllCc" -A40 | head -100Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Search for where NodeGroupTerm is set/populated in the request creation
rg -n "NodeGroupTerm\(" --max-count=20 | grep -E "Set|Reset|="Repository: eloqdata/tx_service
Length of output: 1216
🏁 Script executed:
# Check remote_cc_request.h for RemoteAcquireAll::Reset and ValidTermCheck
sed -n '85,200p' tx_service/include/remote/remote_cc_request.hRepository: eloqdata/tx_service
Length of output: 2884
🏁 Script executed:
# Search for where ClusterConfigCcMap::Execute is called and what validation happens before
rg -n "ClusterConfigCcMap.*Execute\|cluster_config_ccm_\|cluster_config_ccm\|AcquireAllCc" tx_service/src/cc/local_cc_handler.cpp -B5 -A5 | head -100Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Check the critical finding: template_cc_map.h line 2379 checks ng_term < 0
sed -n '2375,2385p' tx_service/include/cc/template_cc_map.hRepository: eloqdata/tx_service
Length of output: 365
🏁 Script executed:
# Check what happens when term < 0 in other methods
rg -n "if.*ng_term.*<.*0\|if.*term.*<.*0" tx_service/include/cc/*.h | head -30Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Search for AcquireAllCc class definition to see if ValidTermCheck exists
rg -n "class AcquireAllCc\|struct AcquireAllCc" tx_service/include/cc/cc_request.h -A50 | head -100Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Check if NodeGroupTerm() comes from the protobuf request message and what values it can have
rg -n "NodeGroupTerm\(\)" tx_service/include/cc/cc_request.h -B2 -A5 | head -40Repository: eloqdata/tx_service
Length of output: 761
🏁 Script executed:
# Search for the base class AcquireAllCc and its NodeGroupTerm implementation
rg -n "class AcquireAllCc" tx_service/include/cc/cc_request.h -A80 | head -120Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Check similar patterns in other cc_maps for comparison
rg -n "req\.NodeGroupTerm\(\)" tx_service/include/cc/*.h -B1 -A1 | grep -E "assert|if.*<|if.*error" | head -20Repository: eloqdata/tx_service
Length of output: 1098
🏁 Script executed:
# Get the full AcquireAllCc class definition
rg -n "struct AcquireAllCc\|class AcquireAllCc" tx_service/include/cc/cc_request.h -A100 | head -150Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Check if the NodeGroupTerm_ is set from protobuf message without validation
rg -n "ng_term_\s*=" tx_service/include/cc/cc_request.h -B2 -A2 | head -40Repository: eloqdata/tx_service
Length of output: 857
🏁 Script executed:
# Verify: Does Sharder::Instance().LeaderTerm() return negative on invalid state?
rg -n "LeaderTerm" tx_service/src/cc/local_cc_shards.cpp -B3 -A10 | head -60Repository: eloqdata/tx_service
Length of output: 2533
Replace assert with graceful term validation to match other cc_map implementations.
This assert will hard-crash if ng_term is ≤ 0, but the codebase shows ng_term can legitimately be negative: (1) ng_term_ is initialized to -1 in cc_request.h and conditionally patched later, (2) Sharder::Instance().LeaderTerm() can return negative values (see local_cc_shards.cpp:1594), and (3) other cc_maps like catalog_cc_map.h, range_bucket_cc_map.h, and range_cc_map.h use graceful if (ng_term < 0) checks instead. Replace the assert with a proper validation check that returns REQUESTED_NODE_NOT_LEADER on invalid term, consistent with the pattern used elsewhere in the codebase.
🤖 Prompt for AI Agents
In `@tx_service/include/cc/cluster_config_cc_map.h` around lines 69 - 71, The
assert on ng_term should be replaced with a graceful validation like other
cc_map implementations: after reading uint32_t ng_id = req.NodeGroupId();
int64_t ng_term = req.NodeGroupTerm(); remove assert(ng_term > 0) and instead
check if (ng_term < 0) and return REQUESTED_NODE_NOT_LEADER (or the function's
error return consistent with catalog_cc_map.h / range_cc_map.h patterns); ensure
you reference ng_term and REQUESTED_NODE_NOT_LEADER so the function exits
cleanly when term is invalid rather than crashing.
| CcReqStatus PostRead(uint64_t tx_number, | ||
| int64_t tx_term, | ||
| uint16_t command_id, | ||
| uint64_t key_ts, | ||
| uint64_t gap_ts, | ||
| uint64_t commit_ts, | ||
| const CcEntryAddr &ccentry_addr, | ||
| CcHandlerResult<PostProcessResult> &hres, | ||
| bool is_local = false, | ||
| bool need_remote_resp = true, | ||
| PostReadType post_read_type = PostReadType::Release, | ||
| bool allow_run_on_candidate = false) override; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Locate RemoteCcHandler::PostRead and candidate flag handling
rg -n --type=cpp 'RemoteCcHandler::PostRead' -A 15
rg -n --type=cpp 'allow_run_on_candidate' --type=cppRepository: eloqdata/tx_service
Length of output: 9489
🏁 Script executed:
#!/bin/bash
# Check the header file for all PostRead overloads in RemoteCcHandler
echo "=== RemoteCcHandler header (remote_cc_handler.h) ==="
rg -n -B2 -A12 'class RemoteCcHandler' tx_service/include/remote/remote_cc_handler.h | head -50
echo ""
echo "=== All PostRead declarations in header ==="
rg -n -B2 -A10 'PostRead' tx_service/include/remote/remote_cc_handler.h
echo ""
echo "=== Check line 329 and context in remote_cc_handler.cpp ==="
sed -n '320,395p' tx_service/src/remote/remote_cc_handler.cppRepository: eloqdata/tx_service
Length of output: 4203
🏁 Script executed:
#!/bin/bash
echo "=== Full RemoteCcHandler::PostRead implementation ==="
sed -n '272,305p' tx_service/src/remote/remote_cc_handler.cpp
echo ""
echo "=== LocalCcHandler::PostRead signature from header ==="
sed -n '110,121p' tx_service/include/cc/local_cc_handler.h
echo ""
echo "=== Check if there are other PostRead overloads in remote handler ==="
rg -n 'PostRead' tx_service/include/remote/remote_cc_handler.h
</parameter>Repository: eloqdata/tx_service
Length of output: 2415
Remote PostRead must propagate the allow_run_on_candidate parameter.
The local PostRead signature now includes allow_run_on_candidate, but the remote PostRead (declared in tx_service/include/remote/remote_cc_handler.h lines 103-114) is missing this parameter. The implementation does not set it on the ValidateRequest message. This means candidate-enabled PostReads routed remotely silently drop the flag, potentially allowing invalid operations on candidate nodes. Update the remote PostRead signature and RPC message plumbing to match the local API.
🤖 Prompt for AI Agents
In `@tx_service/include/cc/local_cc_handler.h` around lines 110 - 121, The remote
PostRead declaration and plumbing must be updated to accept and forward the
allow_run_on_candidate flag: modify the remote handler's PostRead signature in
remote_cc_handler.h to include the bool allow_run_on_candidate parameter
(matching local_cc_handler.h), update the implementation that constructs the
ValidateRequest RPC message to set the corresponding field on ValidateRequest,
and propagate the flag through any intermediate call sites and RPC stubs (e.g.,
methods that build/send ValidateRequest and the RPC handler) so remote PostRead
honors allow_run_on_candidate. Ensure symbols updated include PostRead,
allow_run_on_candidate, and ValidateRequest.
| bool GetNextRangePartitionId(const TableName &tablename, | ||
| const TableSchema *table_schema, | ||
| NodeGroupId ng_id, | ||
| uint32_t range_cnt, | ||
| int32_t &out_next_partition_id); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, verify the function signature at the specified location
echo "=== Function signature at lines 2450-2454 ==="
sed -n '2450,2454p' tx_service/include/cc/local_cc_shards.h
echo -e "\n=== All occurrences of GetNextRangePartitionId ==="
rg -n -C3 'GetNextRangePartitionId\s*\(' --no-headingRepository: eloqdata/tx_service
Length of output: 3970
🏁 Script executed:
#!/bin/bash
# Check if GetNextRangePartitionId is a virtual method in a base class
echo "=== Search for base class or interface definitions ==="
rg -n 'class.*Handler.*{' store_handler/ -A 5
echo -e "\n=== Check for virtual keyword in declarations ==="
rg -n 'virtual.*GetNextRangePartitionId' store_handler/
echo -e "\n=== Check bigtable_handler.h for method signatures ==="
sed -n '220,230p' store_handler/bigtable_handler.h
echo -e "\n=== Check dynamo_handler.h for method signatures ==="
sed -n '165,175p' store_handler/dynamo_handler.hRepository: eloqdata/tx_service
Length of output: 1364
🏁 Script executed:
#!/bin/bash
# Find base class more simply
echo "=== Looking for StoreHandler or base class interface ==="
fd -n "\.h$" store_handler/ | head -5 | xargs -I {} sh -c 'echo "=== {} ==="; head -30 {}'
echo -e "\n=== Search for GetNextRangePartitionId in base/interface files ==="
rg -n 'GetNextRangePartitionId' store_handler/ -B 2 -A 1
echo -e "\n=== Check if there's a StoreHandler base class ==="
fd -n "store_handler" --maxdepth 1 -type f -name "*.h" -exec grep -l "class.*StoreHandler" {} \;Repository: eloqdata/tx_service
Length of output: 1960
🏁 Script executed:
#!/bin/bash
# Find base class definition
echo "=== List store_handler header files ==="
find store_handler -maxdepth 1 -type f -name "*.h"
echo -e "\n=== Search for class declarations in store_handler ==="
rg 'class.*{' store_handler/ -m 1
echo -e "\n=== Check bigtable_handler.h class declaration ==="
head -50 store_handler/bigtable_handler.h | grep -A 20 'class'
echo -e "\n=== Check dynamo_handler.h class declaration ==="
head -50 store_handler/dynamo_handler.h | grep -A 20 'class'
echo -e "\n=== Look for store_handler base/interface ==="
find store_handler -name "*.h" | xargs grep -l 'virtual.*GetNextRangePartitionId'Repository: eloqdata/tx_service
Length of output: 1262
🏁 Script executed:
#!/bin/bash
# Find DataStoreHandler base class definition
echo "=== Search for DataStoreHandler class definition ==="
rg 'class DataStoreHandler' -A 30
echo -e "\n=== Look for GetNextRangePartitionId in base class ==="
rg 'DataStoreHandler' -l | head -5
echo -e "\n=== Check store directory or include files ==="
find . -path ./tx_service -prune -o -name "*.h" -type f -exec grep -l 'DataStoreHandler' {} \; | head -10Repository: eloqdata/tx_service
Length of output: 10061
🏁 Script executed:
#!/bin/bash
# Find GetNextRangePartitionId in the base class definition
echo "=== Extract GetNextRangePartitionId from base class ==="
rg -A 10 'GetNextRangePartitionId' tx_service/include/store/data_store_handler.h
echo -e "\n=== Summary of all signatures ==="
echo "LocalCcShards (tx_service/include/cc/local_cc_shards.h:2450):"
sed -n '2450,2454p' tx_service/include/cc/local_cc_shards.h
echo -e "\nBigTableHandler (store_handler/bigtable_handler.h:222):"
sed -n '222,225p' store_handler/bigtable_handler.h
echo -e "\nDynamoHandler (store_handler/dynamo_handler.h:168):"
sed -n '168,171p' store_handler/dynamo_handler.hRepository: eloqdata/tx_service
Length of output: 1126
🏁 Script executed:
#!/bin/bash
# Search entire data_store_handler.h file for GetNextRangePartitionId
echo "=== Full content of data_store_handler.h ==="
wc -l tx_service/include/store/data_store_handler.h
echo -e "\n=== Search for GetNextRangePartitionId ==="
grep -n "GetNextRangePartitionId" tx_service/include/store/data_store_handler.h
echo -e "\n=== Check if DataStoreHandler has any method named GetNext ==="
grep -n "GetNext" tx_service/include/store/data_store_handler.h
echo -e "\n=== Get entire DataStoreHandler class definition ==="
sed -n '/^class DataStoreHandler/,/^}/p' tx_service/include/store/data_store_handler.h | head -150Repository: eloqdata/tx_service
Length of output: 5725
🏁 Script executed:
#!/bin/bash
# Verify that GetNextRangePartitionId is not in the base class
echo "=== Full DataStoreHandler file length ==="
wc -l tx_service/include/store/data_store_handler.h
echo -e "\n=== Search entire file for GetNextRangePartitionId ==="
grep -c "GetNextRangePartitionId" tx_service/include/store/data_store_handler.h
echo -e "\n=== All usages of GetNextRangePartitionId in codebase ==="
rg 'GetNextRangePartitionId' --no-heading -l
echo -e "\n=== Check if handler methods are actually called ==="
rg 'handler.*GetNextRangePartitionId|GetNextRangePartitionId.*handler' -C 3
echo -e "\n=== Verify LocalCcShards call site ==="
sed -n '5648,5658p' tx_service/src/cc/local_cc_shards.cppRepository: eloqdata/tx_service
Length of output: 1035
Fix compilation errors: override keyword on non-existent base class methods.
BigTableHandler and DynamoHandler declare GetNextRangePartitionId with the override keyword, but this method does not exist in the base class DataStoreHandler. This causes a compilation error. Either remove the override keyword or define the method in the base class.
Additionally, the handler signatures are incompatible with LocalCcShards::GetNextRangePartitionId:
- LocalCcShards:
GetNextRangePartitionId(tablename, table_schema, ng_id, range_cnt, out_partition_id) - Handlers:
GetNextRangePartitionId(tablename, range_cnt, out_partition_id, retry_count)
The handlers are missing table_schema and ng_id parameters while adding retry_count. Reconcile the signatures across all implementations.
🤖 Prompt for AI Agents
In `@tx_service/include/cc/local_cc_shards.h` around lines 2450 - 2454, The
compilation fails because BigTableHandler::GetNextRangePartitionId and
DynamoHandler::GetNextRangePartitionId use override but no matching virtual
exists in DataStoreHandler, and their parameter lists differ from
LocalCcShards::GetNextRangePartitionId; fix by adding a virtual declaration in
DataStoreHandler with the canonical signature used by LocalCcShards — bool
GetNextRangePartitionId(const TableName &tablename, const TableSchema
*table_schema, NodeGroupId ng_id, uint32_t range_cnt, int32_t
&out_next_partition_id) — or alternatively remove override from the handlers;
then update BigTableHandler::GetNextRangePartitionId and
DynamoHandler::GetNextRangePartitionId to match that canonical parameter list
(add table_schema and ng_id, remove retry_count) so all declarations/definitions
are consistent.
| int64_t ng_term = req.NodeGroupTerm(); | ||
| if (ng_term < 0) | ||
| { | ||
| req.Result()->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER); | ||
| return true; | ||
| if (req.AllowRunOnCandidate()) | ||
| { | ||
| ng_term = | ||
| Sharder::Instance().CandidateLeaderTerm(req.NodeGroupId()); | ||
| } | ||
| if (ng_term < 0) | ||
| { | ||
| ng_term = Sharder::Instance().LeaderTerm(req.NodeGroupId()); | ||
| int64_t standby_node_term = | ||
| Sharder::Instance().StandbyNodeTerm(); | ||
| ng_term = std::max(ng_term, standby_node_term); | ||
| } | ||
| } | ||
| assert(ng_term > 0); |
There was a problem hiding this comment.
Avoid assert‑only handling for unresolved ng_term.
If all fallbacks still yield a negative term, the assert can crash or (in release) proceed with invalid metadata. Prefer a runtime error.
🔧 Proposed fix (runtime error on invalid term)
- assert(ng_term > 0);
+ if (ng_term <= 0)
+ {
+ req.Result()->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);
+ return true;
+ }🤖 Prompt for AI Agents
In `@tx_service/include/cc/range_bucket_cc_map.h` around lines 92 - 108, The code
currently relies on assert(ng_term > 0) after trying req.NodeGroupTerm(),
CandidateLeaderTerm(), LeaderTerm() and StandbyNodeTerm(); replace this assert
with a runtime error path: after computing ng_term check if ng_term <= 0 and if
so log or construct a descriptive error including req.NodeGroupId(), the
original req.NodeGroupTerm(), CandidateLeaderTerm(...), LeaderTerm(...), and
StandbyNodeTerm(), then throw a std::runtime_error (or return a failure status
according to your error model) instead of asserting so callers receive a
deterministic error rather than crashing or continuing with invalid metadata.
| int64_t ng_term = req.NodeGroupTerm(); | ||
| if (ng_term < 0) | ||
| { | ||
| hd_result->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER); | ||
| return true; | ||
| if (req.AllowRunOnCandidate()) | ||
| { | ||
| ng_term = | ||
| Sharder::Instance().CandidateLeaderTerm(req.NodeGroupId()); | ||
| } | ||
| if (ng_term < 0) | ||
| { | ||
| ng_term = Sharder::Instance().LeaderTerm(req.NodeGroupId()); | ||
| int64_t standby_node_term = | ||
| Sharder::Instance().StandbyNodeTerm(); | ||
| ng_term = std::max(ng_term, standby_node_term); | ||
| } | ||
| } | ||
| assert(ng_term > 0); |
There was a problem hiding this comment.
Handle missing node‑group term without relying on assert
Line 220‑236: after the candidate/leader/standby fallback, ng_term can still be <0 when no valid term exists (e.g., during elections). In non‑assert builds this value flows into AcquireCceKeyLock and the returned CcEntryAddr, risking invalid lock ownership. Prefer returning a NOT_LEADER/RETRY error before proceeding.
💡 Proposed guard to fail fast when no valid term is available
if (ng_term < 0)
{
if (req.AllowRunOnCandidate())
{
ng_term =
Sharder::Instance().CandidateLeaderTerm(req.NodeGroupId());
}
if (ng_term < 0)
{
ng_term = Sharder::Instance().LeaderTerm(req.NodeGroupId());
int64_t standby_node_term =
Sharder::Instance().StandbyNodeTerm();
ng_term = std::max(ng_term, standby_node_term);
}
}
- assert(ng_term > 0);
+ if (ng_term < 0)
+ {
+ hd_result->SetError(CcErrorCode::REQUESTED_NODE_NOT_LEADER);
+ hd_result->SetFinished();
+ return true;
+ }
+ assert(ng_term > 0);🤖 Prompt for AI Agents
In `@tx_service/include/cc/range_cc_map.h` around lines 220 - 236, The code
currently asserts ng_term > 0 after trying CandidateLeaderTerm, LeaderTerm and
StandbyNodeTerm; replace the assert with a proper early return that fails fast
when no valid term exists: check if ng_term <= 0 after those fallbacks and
return an appropriate NOT_LEADER or RETRY error code (the same error path the
caller expects) instead of proceeding to call AcquireCceKeyLock and constructing
a CcEntryAddr; reference the Sharder::Instance() calls and methods
CandidateLeaderTerm, LeaderTerm, StandbyNodeTerm and ensure the function returns
the error before any use of AcquireCceKeyLock/CcEntryAddr.
| int64_t ng_term = req.NodeGroupTerm(); | ||
| assert(ng_term > 0); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the actual code at the specified lines
cat -n tx_service/include/cc/template_cc_map.h | sed -n '200,210p'
cat -n tx_service/include/cc/template_cc_map.h | sed -n '690,705p'Repository: eloqdata/tx_service
Length of output: 1259
🏁 Script executed:
# Search for ValidTermCheck function definition and usage
rg -n "ValidTermCheck" -A 5 tx_service/Repository: eloqdata/tx_service
Length of output: 23652
🏁 Script executed:
# Find the broader context of AcquireCc and AcquireAllCc to understand the term lifecycle
rg -n "void AcquireCc\|void AcquireAllCc" -A 20 tx_service/include/cc/template_cc_map.hRepository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Check for callers of AcquireCc to understand where NodeGroupTerm comes from
rg -n "\.AcquireCc\(|\.AcquireAllCc\(" -B 5 -A 2 tx_service/Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Find the Execute method that contains line 202-203
sed -n '175,250p' tx_service/include/cc/template_cc_map.hRepository: eloqdata/tx_service
Length of output: 3430
🏁 Script executed:
# Look for the request class definition and its entry point
rg -n "class.*Acquire.*Cc" tx_service/include/cc/template_cc_map.h | head -20Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Check if there's a base class Execute that calls ValidTermCheck before AcquireCc
rg -n "bool Execute" tx_service/include/cc/template_cc_map.h | head -10Repository: eloqdata/tx_service
Length of output: 566
🏁 Script executed:
# Search for how requests are dispatched to AcquireCc - look at request lifecycle
rg -n "req.NodeGroupTerm()" tx_service/include/cc/template_cc_map.h -B 5 -A 10 | head -80Repository: eloqdata/tx_service
Length of output: 2864
🏁 Script executed:
# Look at base request class to see if there's pre-Execute validation
sed -n '115,145p' tx_service/include/cc/cc_request.hRepository: eloqdata/tx_service
Length of output: 755
🏁 Script executed:
# Check if Execute() is called directly or through a wrapper that validates
rg -n "ValidTermCheck" tx_service/include/cc/template_cc_map.hRepository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Search for how Execute() is invoked - look for dispatch patterns
rg -n "\.Execute\(" tx_service/include/cc/ -B 2 | grep -A 2 "AcquireCc\|AcquireAllCc" | head -40Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Check the CcRequest base class wrapper
rg -n "class.*AcquireCc\b" tx_service/include/cc/cc_request.h -A 30 | head -50Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Get more context from the base class wrapper to see exact flow
sed -n '130,165p' tx_service/include/cc/cc_request.hRepository: eloqdata/tx_service
Length of output: 1244
🏁 Script executed:
# Check what NodeGroupTerm() returns - look for its definition
rg -n "NodeGroupTerm()" tx_service/include/cc/ --type-list | head -5Repository: eloqdata/tx_service
Length of output: 452
🏁 Script executed:
# Find where NodeGroupTerm is set on the request object
rg -n "SetNodeGroupTerm|NodeGroupTerm_|ng_term_" tx_service/include/cc/cc_request.h | head -20Repository: eloqdata/tx_service
Length of output: 980
🏁 Script executed:
# Verify: does NodeGroupTerm() return the validated ng_term_ or something else?
rg -n "int64_t NodeGroupTerm()" tx_service/include/cc/ -A 3Repository: eloqdata/tx_service
Length of output: 1491
🏁 Script executed:
# Look at how the base class wrapper invokes the derived Execute
sed -n '145,165p' tx_service/include/cc/cc_request.hRepository: eloqdata/tx_service
Length of output: 958
🏁 Script executed:
# Verify the flow: TemplatedCcRequest::Execute calls ValidTermCheck, then invokes typed_req->Execute
rg -n "typed_req->Execute" tx_service/include/cc/cc_request.h -B 5 -A 2Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Look at PostWriteCc Execute to see if it uses runtime guards like ReadCc
sed -n '478,520p' tx_service/include/cc/template_cc_map.hRepository: eloqdata/tx_service
Length of output: 1631
🏁 Script executed:
# Confirm the exact flow: does TemplatedCcRequest::Execute call ValidTermCheck before invoking typed_req->Execute?
sed -n '136,180p' tx_service/include/cc/cc_request.hRepository: eloqdata/tx_service
Length of output: 2111
🏁 Script executed:
# Check what happens at the call site - is typed_req->Execute called after ValidTermCheck returns true?
sed -n '136,170p' tx_service/include/cc/cc_request.h | tail -35Repository: eloqdata/tx_service
Length of output: 1418
🏁 Script executed:
# Now check if AcquireCc is overriding or if it's directly called
# Look for class AcquireCc definition
rg -n "^class AcquireCc" tx_service/include/cc/ -A 10Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Critical: Check if req.NodeGroupTerm() can change between ValidTermCheck and Execute
# Look at what NodeGroupTerm represents - is it a volatile value or cached?
rg -n "ng_term_\s*=" tx_service/include/cc/cc_request.h | head -20Repository: eloqdata/tx_service
Length of output: 889
🏁 Script executed:
# Confirm: Does ValidTermCheck set ng_term_ for ALL request types including AcquireCc?
rg -n "class AcquireCc\b" tx_service/include/cc/cc_request.h -A 40 | head -50Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Check the fault injector test more carefully - it sets ng_term = -1 AFTER ValidTermCheck
# This proves ng_term can become invalid even after validation
sed -n '200,220p' tx_service/include/cc/template_cc_map.hRepository: eloqdata/tx_service
Length of output: 795
🏁 Script executed:
# Check if there are any other places where ng_term is reassigned after ValidTermCheck
rg -n "ng_term\s*=" tx_service/include/cc/template_cc_map.h | grep -A 5 -B 5 "202\|696"Repository: eloqdata/tx_service
Length of output: 507
🏁 Script executed:
# Compare consistency: PostWriteCc doesn't have assert for ng_term. Check what it does.
rg -n "ng_term" tx_service/include/cc/template_cc_map.h | head -30Repository: eloqdata/tx_service
Length of output: 1583
🏁 Script executed:
# Final check: look at the learning pattern - unpinning should only happen if term >= 0
# Does AcquireCc or AcquireAllCc call any pin/unpin that should be guarded?
rg -n "TryPinNodeGroupData\|UnpinNodeGroupData" tx_service/include/cc/template_cc_map.hRepository: eloqdata/tx_service
Length of output: 45
Add runtime guard for invalid ng_term instead of relying on assert.
Lines 202-203 and 695-697 use assertions for term validation, which fail silently in release builds. The fault injector test (line 204-209) explicitly demonstrates that ng_term can become invalid within Execute(). Use a runtime check like ReadCc does (line 1391-1406): if ng_term <= 0, return REQUESTED_NODE_NOT_LEADER error rather than proceeding. This aligns with the codebase pattern for term safety and matches the fault injector's expected error handling.
🤖 Prompt for AI Agents
In `@tx_service/include/cc/template_cc_map.h` around lines 202 - 203, Replace the
assert-based term validation with a runtime check: in the locations where you do
"int64_t ng_term = req.NodeGroupTerm();" (including the Execute() path and the
other site around the current asserts), if ng_term <= 0 return the same error
used by ReadCc (REQUESTED_NODE_NOT_LEADER) instead of asserting; update the
function that returns the RPC/Status to propagate REQUESTED_NODE_NOT_LEADER so
callers handle it like ReadCc (see ReadCc handling around lines ~1391-1406 for
the exact behavior) and remove the assert(ng_term > 0) usages.
tx_service/include/data_sync_task.h
Outdated
| void SetSyncOnLeader(bool sync_on_leader) | ||
| { | ||
| sync_on_leader_ = sync_on_leader; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd tx_service && find . -name "data_sync_task.h" -type fRepository: eloqdata/tx_service
Length of output: 89
🏁 Script executed:
wc -l tx_service/include/data_sync_task.hRepository: eloqdata/tx_service
Length of output: 102
🏁 Script executed:
sed -n '1,100p' tx_service/include/data_sync_task.h | cat -nRepository: eloqdata/tx_service
Length of output: 4097
🏁 Script executed:
sed -n '180,250p' tx_service/include/data_sync_task.h | cat -nRepository: eloqdata/tx_service
Length of output: 2728
🏁 Script executed:
rg "SetSyncOnLeader" --type cpp --type h -B 2 -A 2Repository: eloqdata/tx_service
Length of output: 1058
🏁 Script executed:
rg "sync_on_leader_" --type cpp --type h -B 2 -A 2Repository: eloqdata/tx_service
Length of output: 3431
🏁 Script executed:
sed -n '240,250p' tx_service/include/data_sync_task.h | cat -nRepository: eloqdata/tx_service
Length of output: 403
🏁 Script executed:
rg "SetSyncOnLeader" --type cpp -B 5 -A 5Repository: eloqdata/tx_service
Length of output: 2508
🏁 Script executed:
rg "sync_on_leader_.*?" --type cpp -B 3 -A 3 | head -100Repository: eloqdata/tx_service
Length of output: 4643
🏁 Script executed:
rg "SetSyncOnLeader" --type cpp -B 15 -A 5Repository: eloqdata/tx_service
Length of output: 4562
🏁 Script executed:
rg "sync_on_leader_" --type cpp -B 5 -A 5 | head -150Repository: eloqdata/tx_service
Length of output: 7268
🏁 Script executed:
rg "std::shared_ptr.*DataSyncTask" --type cpp --type h -A 3 | head -80Repository: eloqdata/tx_service
Length of output: 5275
🏁 Script executed:
rg "DataSyncForRangePartition|DataSyncForHashPartition" --type cpp -B 3 -A 20 | head -150Repository: eloqdata/tx_service
Length of output: 10230
🏁 Script executed:
grep -n "SetSyncOnLeader" tx_service/src/cc/local_cc_shards.cppRepository: eloqdata/tx_service
Length of output: 176
🏁 Script executed:
sed -n '1,50p' tx_service/src/cc/local_cc_shards.cpp | head -50Repository: eloqdata/tx_service
Length of output: 1573
🏁 Script executed:
rg "data_sync_task_queue_" --type cpp --type h -B 2 -A 2 | head -80Repository: eloqdata/tx_service
Length of output: 5807
🏁 Script executed:
sed -n '3620,3650p' tx_service/src/cc/local_cc_shards.cpp | cat -nRepository: eloqdata/tx_service
Length of output: 1581
🏁 Script executed:
sed -n '4680,4710p' tx_service/src/cc/local_cc_shards.cpp | cat -nRepository: eloqdata/tx_service
Length of output: 1596
🏁 Script executed:
rg "void.*DataSyncForRangePartition" --type cpp -A 30 | head -80Repository: eloqdata/tx_service
Length of output: 5032
🏁 Script executed:
sed -n '180,250p' tx_service/include/data_sync_task.h | cat -nRepository: eloqdata/tx_service
Length of output: 2728
🏁 Script executed:
rg "void.*Sync.*Adjustable|void.*Set.*OnLeader" --type h --type cpp -B 2 -A 5Repository: eloqdata/tx_service
Length of output: 754
🏁 Script executed:
rg "last_sync_ts.*=.*sync_on_leader_" --type cpp -B 10 | head -100Repository: eloqdata/tx_service
Length of output: 911
🏁 Script executed:
rg "PostProcessFlushTaskEntries|FinishDataSyncTask" --type cpp --type h -B 2 -A 10 | head -120Repository: eloqdata/tx_service
Length of output: 2966
🏁 Script executed:
rg "FinishDataSync|SetScanTaskFinished|SetFinish" --type cpp -B 3 -A 8 | head -100Repository: eloqdata/tx_service
Length of output: 5644
🏁 Script executed:
rg "void.*Worker\(\)" --type cpp --type h | grep -i "data\|sync\|flush" | head -20Repository: eloqdata/tx_service
Length of output: 898
🏁 Script executed:
sed -n '4700,4750p' tx_service/src/cc/local_cc_shards.cpp | cat -nRepository: eloqdata/tx_service
Length of output: 2323
🏁 Script executed:
rg "sync_on_leader_.*\?" --type cpp -B 2 -A 5 | head -80Repository: eloqdata/tx_service
Length of output: 2741
🏁 Script executed:
rg "need_update_ckpt_ts_" --type cpp --type h -B 2 -A 2Repository: eloqdata/tx_service
Length of output: 1370
Guard sync_on_leader_ updates against data races.
The non-atomic write in SetSyncOnLeader() (lines 186-189) can race with reads in finalization code paths. While SetSyncOnLeader() is called during worker thread execution in DataSyncForRangePartition(), the DataSyncTask object is shared via shared_ptr across threads. The reads of sync_on_leader_ in finalization (e.g., task->sync_on_leader_ ? task->data_sync_ts_ : 0) occur without synchronization.
The same issue applies to the member declaration at line 248 (shown as "Also applies to: 246-247"), where sync_on_leader_ is a plain bool with no atomic or mutex protection. Use std::atomic<bool> or guard access with a mutex to ensure thread safety.
🤖 Prompt for AI Agents
In `@tx_service/include/data_sync_task.h` around lines 186 - 189, The
SetSyncOnLeader() setter and unsynchronized reads of sync_on_leader_ in
DataSyncTask can race; change the DataSyncTask member sync_on_leader_ to
std::atomic<bool> and update SetSyncOnLeader() to call
sync_on_leader_.store(sync_on_leader, std::memory_order_release) and replace all
unsynchronized reads (e.g., in finalization expressions like
task->sync_on_leader_ ? task->data_sync_ts_ : 0) with
sync_on_leader_.load(std::memory_order_acquire) so reads and writes are properly
synchronized.
| bool UpdateCceCkptTsCc::ValidTermCheck() const | ||
| { | ||
| int64_t ng_term = Sharder::Instance().LeaderTerm(node_group_id_); | ||
| int64_t candidate_ng_term = | ||
| Sharder::Instance().CandidateLeaderTerm(node_group_id_); | ||
| ng_term = std::max(ng_term, candidate_ng_term); | ||
| int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm(); | ||
| int64_t current_term = std::max(ng_term, standby_node_term); | ||
|
|
||
| if (current_term < 0 || current_term != term_) | ||
| { | ||
| LOG(INFO) | ||
| << "UpdateCceCkptTsCc::ValidTermCheck failed with current term: " | ||
| << current_term << ", term_: " << term_; | ||
| return false; | ||
| } | ||
|
|
||
| return true; | ||
| } | ||
|
|
||
| bool UpdateCceCkptTsCc::Execute(CcShard &ccs) | ||
| { | ||
| if (!ValidTermCheck()) | ||
| { | ||
| LOG(INFO) << "UpdateCceCkptTsCc::ValidTermCheck failed on shard: " | ||
| << ccs.core_id_; | ||
| SetFinished(); | ||
| return false; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find the UpdateCceCkptTsCc class definition and understand term_ initialization
cd tx_service && git ls-files | grep -E "\.(h|hpp|cpp)$" | head -20Repository: eloqdata/tx_service
Length of output: 593
🏁 Script executed:
# Look at the header file for cc_req_misc to understand the class structure
rg -A 30 "class UpdateCceCkptTsCc" --type cpp --type hRepository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Find other ValidTermCheck implementations to understand the pattern
rg "ValidTermCheck\(" --type cpp --type h -B 2 -A 10Repository: eloqdata/tx_service
Length of output: 44362
🏁 Script executed:
# Search for callers of UpdateCceCkptTsCc constructor to see how term_ is set
rg "UpdateCceCkptTsCc" --type cpp --type h -B 3 -A 3Repository: eloqdata/tx_service
Length of output: 5534
🏁 Script executed:
# Find the full UpdateCceCkptTsCc constructor definition
rg -A 20 "UpdateCceCkptTsCc\(" tx_service/include/cc/cc_req_misc.h | head -40Repository: eloqdata/tx_service
Length of output: 1131
🏁 Script executed:
# Search for where UpdateCceCkptTsCc is constructed
rg "UpdateCceCkptTsCc\s*\(" --type cpp --type h -B 2 -A 5Repository: eloqdata/tx_service
Length of output: 1091
🏁 Script executed:
# Check the construction site more carefully
rg -B 5 -A 10 "UpdateCceCkptTsCc update_cce_req" --type cppRepository: eloqdata/tx_service
Length of output: 1336
🏁 Script executed:
# Check data_sync_task structure for node_group_term_ initialization
rg -A 30 "struct DataSyncTask" tx_service/include/data_sync_task.h | head -50Repository: eloqdata/tx_service
Length of output: 1647
🏁 Script executed:
# Look for where node_group_term_ is assigned in data_sync_task
rg "node_group_term_" tx_service/include/data_sync_task.h -B 2 -A 2Repository: eloqdata/tx_service
Length of output: 616
🏁 Script executed:
# Check if there's documentation or patterns showing when node_group_term_ is set to -1 vs positive values
rg "node_group_term_.*=" tx_service/include/data_sync_task.h -B 3 -A 3Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Look for where DataSyncTask is instantiated to see what ng_term is passed
rg "DataSyncTask\(" --type cpp -B 3 -A 8 | head -60Repository: eloqdata/tx_service
Length of output: 4315
🏁 Script executed:
# Check other const ValidTermCheck implementations to understand the pattern better
rg -A 8 "bool ValidTermCheck\(\) const" --type hRepository: eloqdata/tx_service
Length of output: 1317
🏁 Script executed:
# Check if DataSyncTask constructor validates ng_term > 0
rg -A 50 "DataSyncTask::DataSyncTask" tx_service/src/cc/local_cc_shards.cpp | head -80Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Check calls to EnqueueRangeDataSyncTask and EnqueueHashPartitionDataSyncTask to see what ng_term is passed
rg "EnqueueRangeDataSyncTask\|EnqueueHashPartitionDataSyncTask" --type cpp -B 3 -A 3 | head -100Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Specifically search for assertions or checks on node_group_term_ or ng_term in data_sync_task initialization
rg "assert.*ng_term\|assert.*node_group_term_" --type cpp --type h -B 2 -A 2Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Look at the context where UpdateCceCkptTsCc is constructed more carefully
rg -B 20 "UpdateCceCkptTsCc update_cce_req" tx_service/src/cc/local_cc_shards.cpp | tail -30Repository: eloqdata/tx_service
Length of output: 963
🏁 Script executed:
# Check the data_sync_task that's being passed
rg -B 50 "UpdateCceCkptTsCc update_cce_req" tx_service/src/cc/local_cc_shards.cpp | grep -A 50 "if (cce_entries_map.size" | head -20Repository: eloqdata/tx_service
Length of output: 182
🏁 Script executed:
# Look for the actual PostProcessFlushTaskEntries function context
rg -B 10 -A 30 "void LocalCcShards::PostProcessFlushTaskEntries" tx_service/src/cc/local_cc_shards.cpp | head -60Repository: eloqdata/tx_service
Length of output: 1423
🏁 Script executed:
# Get more context of PostProcessFlushTaskEntries to see how ng_term is handled
rg -A 80 "void LocalCcShards::PostProcessFlushTaskEntries" tx_service/src/cc/local_cc_shards.cpp | head -100Repository: eloqdata/tx_service
Length of output: 3355
🏁 Script executed:
# Look for where UpdateCceCkptTsCc is enqueued with this ng_term
rg -A 150 "void LocalCcShards::PostProcessFlushTaskEntries" tx_service/src/cc/local_cc_shards.cpp | grep -A 30 "UpdateCceCkptTsCc"Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Get the full PostProcessFlushTaskEntries function
rg -A 250 "void LocalCcShards::PostProcessFlushTaskEntries" tx_service/src/cc/local_cc_shards.cpp | tail -150Repository: eloqdata/tx_service
Length of output: 5846
🏁 Script executed:
# Search more specifically for UpdateCceCkptTsCc in the function
sed -n '2949,3188p' tx_service/src/cc/local_cc_shards.cppRepository: eloqdata/tx_service
Length of output: 8890
🏁 Script executed:
# Let's find where UpdateCceCkptTsCc is instantiated to see what term values are passed
rg -B 10 "UpdateCceCkptTsCc update_cce_req" tx_service/src/cc/local_cc_shards.cpp | head -25Repository: eloqdata/tx_service
Length of output: 545
🏁 Script executed:
# Check if there's any code that could create a DataSyncTask with a -1 term
rg "node_group_term_\s*=" tx_service/src/cc/local_cc_shards.cpp | head -20Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Check term_ field in UpdateCceCkptTsCc to see if there's any assertion or check
rg -B 5 -A 15 "struct UpdateCceCkptTsCc" tx_service/include/cc/cc_req_misc.h | tail -30Repository: eloqdata/tx_service
Length of output: 547
🏁 Script executed:
# Get the complete instantiation of UpdateCceCkptTsCc with all parameters
rg -A 4 "UpdateCceCkptTsCc update_cce_req\(" tx_service/src/cc/local_cc_shards.cppRepository: eloqdata/tx_service
Length of output: 324
🏁 Script executed:
# Check if the data_sync_task is guaranteed to have a positive term by searching how it's constructed
rg -B 20 "EnqueueRangeDataSyncTask" tx_service/src/cc/local_cc_shards.cpp | grep -A 20 "DataSyncTask"Repository: eloqdata/tx_service
Length of output: 1334
🏁 Script executed:
# Look for all places where UpdateCceCkptTsCc is instantiated
rg "UpdateCceCkptTsCc" tx_service/src/cc/local_cc_shards.cpp -B 5 -A 8Repository: eloqdata/tx_service
Length of output: 1563
🏁 Script executed:
# Check if there are any DataSyncTask creations with ng_term that might be -1 or uninitialized
rg "DataSyncTask\(" tx_service/src/cc/local_cc_shards.cpp -A 8 | grep -B 5 -A 3 "ng_term"Repository: eloqdata/tx_service
Length of output: 1900
🏁 Script executed:
# Look for the actual field declarations in UpdateCceCkptTsCc to see term_ initialization
rg -A 50 "struct UpdateCceCkptTsCc" tx_service/include/cc/cc_req_misc.h | grep -A 5 "term_\|int64_t"Repository: eloqdata/tx_service
Length of output: 764
🏁 Script executed:
# Check if there's an assertion or check before UpdateCceCkptTsCc is constructed
rg -B 30 "UpdateCceCkptTsCc update_cce_req" tx_service/src/cc/local_cc_shards.cpp | grep -A 30 "if (cce_entries_map.size"Repository: eloqdata/tx_service
Length of output: 182
🏁 Script executed:
# Check if there's any place where node_group_term_ might be left as -1 in DataSyncTask before being used for UpdateCceCkptTsCc
rg -B 5 -A 5 "data_sync_task_->node_group_term_" tx_service/src/cc/local_cc_shards.cpp | head -40Repository: eloqdata/tx_service
Length of output: 554
🏁 Script executed:
# Check if term_ field in UpdateCceCkptTsCc has any member initialization or assertions
rg -A 100 "struct UpdateCceCkptTsCc" tx_service/include/cc/cc_req_misc.h | tail -50Repository: eloqdata/tx_service
Length of output: 1235
🏁 Script executed:
# Verify if there's any code path where UpdateCceCkptTsCc can be created with term_ < 0
rg "UpdateCceCkptTsCc" tx_service --type cpp -B 15 -A 3 | grep -A 15 "node_group_term_"Repository: eloqdata/tx_service
Length of output: 240
🏁 Script executed:
# Check if term_ could potentially be -1 by looking at how DataSyncTask is initialized
rg "DataSyncTask(" tx_service/src/cc/local_cc_shards.cpp -A 12 | grep -B 5 -A 7 "ng_term"Repository: eloqdata/tx_service
Length of output: 139
🏁 Script executed:
# Verify the concern: check if there's any scenario where term_ is not explicitly set or could be -1
rg -A 5 "node_group_term_\{" tx_service/include/data_sync_task.hRepository: eloqdata/tx_service
Length of output: 185
🏁 Script executed:
# Found the issue! node_group_term_ defaults to -1. Now let's check if DataSyncTask is ever created without explicitly setting ng_term
rg "std::make_shared<DataSyncTask>" tx_service/src/cc/local_cc_shards.cpp -A 15 | head -50Repository: eloqdata/tx_service
Length of output: 2177
🏁 Script executed:
# Check all DataSyncTask constructor calls to see if ng_term is always provided
rg "make_shared.*DataSyncTask\|std::shared_ptr.*DataSyncTask" tx_service/src/cc/local_cc_shards.cpp -A 5Repository: eloqdata/tx_service
Length of output: 45
🏁 Script executed:
# Look at the DataSyncTask constructors more carefully
rg -A 30 "DataSyncTask::" tx_service/src/cc/data_sync_task.cpp | head -80Repository: eloqdata/tx_service
Length of output: 139
🏁 Script executed:
# Let's check what values are passed for ng_term when EnqueueRangeDataSyncTask is called
rg "EnqueueRangeDataSyncTask" tx_service/src/cc/local_cc_shards.cpp -B 5 -A 1 | grep -B 5 "ng_"Repository: eloqdata/tx_service
Length of output: 556
🏁 Script executed:
# Check if there's any DataSyncTask created without explicit ng_term parameter (would use default -1)
rg "std::make_shared.*DataSyncTask\|DataSyncTask(" tx_service/include/data_sync_task.h -A 15 | head -60Repository: eloqdata/tx_service
Length of output: 203
🏁 Script executed:
# Review the key creation site again - does node_group_term_ ever stay at default -1?
rg -B 30 "if \(cce_entries_map.size\(\) > 0\)" tx_service/src/cc/local_cc_shards.cpp | tail -40Repository: eloqdata/tx_service
Length of output: 1316
Align ValidTermCheck() with const-variant patterns by adding an assertion.
The const ValidTermCheck() lacks a safeguard that similar implementations in cc_request.h use. Add assert(term_ > 0); before the term comparison to catch cases where term_ might not be properly initialized, consistent with how other const ValidTermCheck() methods guard against uninitialized stored terms.
🤖 Prompt for AI Agents
In `@tx_service/src/cc/cc_req_misc.cpp` around lines 1171 - 1199, The const method
UpdateCceCkptTsCc::ValidTermCheck() is missing the assert guard used by other
const variants; add assert(term_ > 0); near the start of ValidTermCheck (before
the comparison against current_term) so uninitialized/invalid term_ values are
caught early; ensure the assertion is placed in the body of
UpdateCceCkptTsCc::ValidTermCheck() prior to the if (current_term < 0 ||
current_term != term_) check and that <cassert> is available in the translation
unit.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tx_service/include/cc/cc_request.h (1)
4167-4204:⚠️ Potential issue | 🟠 MajorRunOnCandidateNode flag is currently unused in term checks.
RangePartitionDataSyncScanCc stores run_on_candidate_node_ but ValidTermCheck always includes candidate term. This undermines the flag’s purpose and can allow candidate execution even when not intended.
Possible fix
- int64_t cc_ng_term = Sharder::Instance().LeaderTerm(node_group_id_); - int64_t candidate_ng_term = - Sharder::Instance().CandidateLeaderTerm(node_group_id_); - int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm(); - int64_t current_term = - std::max({cc_ng_term, standby_node_term, candidate_ng_term}); + int64_t cc_ng_term = Sharder::Instance().LeaderTerm(node_group_id_); + int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm(); + int64_t current_term = std::max(cc_ng_term, standby_node_term); + if (run_on_candidate_node_) + { + int64_t candidate_ng_term = + Sharder::Instance().CandidateLeaderTerm(node_group_id_); + current_term = std::max(current_term, candidate_ng_term); + }Also applies to: 4239-4261, 4324-4355, 4560-4563, 4739-4739
🤖 Fix all issues with AI agents
In `@tx_service/include/cc/cc_request.h`:
- Around line 565-585: The ValidTermCheck currently unconditionally accepts
candidate leader terms by taking the max of Sharder::Instance().LeaderTerm and
CandidateLeaderTerm; add an explicit boolean flag (e.g., allow_run_on_candidate_
or a parameter to ValidTermCheck) and change ValidTermCheck (and any callers) to
only consider CandidateLeaderTerm when that flag is true; update callers such as
AcquireAllCc and PostWriteAllCc to pass/set the flag only when they should
permit running on candidate leaders (mirroring AcquireCc's behavior), and ensure
ng_term_ and the comparison logic remain unchanged when the flag is false so
candidate terms are not accepted implicitly.
In `@tx_service/src/cc/local_cc_shards.cpp`:
- Around line 3201-3204: The CheckLeaderTerm call currently passes 'true'
unconditionally which permits candidate-term acceptance; change the call to only
allow candidate term when the task explicitly allows it (e.g., use
!task->sync_on_leader_ or an explicit allow_candidate flag) so leader-only tasks
still fail fast—update the call site
Sharder::Instance().CheckLeaderTerm(task->node_group_id_,
task->node_group_term_, /*allow_candidate=*/!task->sync_on_leader_) and apply
the same conditional change to the other occurrences (the other CheckLeaderTerm
call sites flagged in the review) so leader-only logic uses false for
allow_candidate.
| bool ValidTermCheck() override | ||
| { | ||
| int64_t cc_ng_term = Sharder::Instance().LeaderTerm(node_group_id_); | ||
| int64_t candidate_ng_term = | ||
| Sharder::Instance().CandidateLeaderTerm(node_group_id_); | ||
| cc_ng_term = std::max(cc_ng_term, candidate_ng_term); | ||
|
|
||
| if (ng_term_ < 0) | ||
| { | ||
| ng_term_ = cc_ng_term; | ||
| } | ||
|
|
||
| if (cc_ng_term < 0 || cc_ng_term != ng_term_) | ||
| { | ||
| return false; | ||
| } | ||
| else | ||
| { | ||
| return true; | ||
| } | ||
| } |
There was a problem hiding this comment.
Gate candidate-term acceptance behind an explicit flag.
AcquireAllCc and PostWriteAllCc now always accept candidate terms, which bypasses the new “allow_run_on_candidate” intent. If only some flows should run on candidates, add a flag (as in AcquireCc) or otherwise document why these paths are unconditional.
Suggested direction
- int64_t cc_ng_term = Sharder::Instance().LeaderTerm(node_group_id_);
- int64_t candidate_ng_term =
- Sharder::Instance().CandidateLeaderTerm(node_group_id_);
- cc_ng_term = std::max(cc_ng_term, candidate_ng_term);
+ int64_t cc_ng_term = -1;
+ if (allow_run_on_candidate_)
+ {
+ cc_ng_term = Sharder::Instance().CandidateLeaderTerm(node_group_id_);
+ }
+ if (cc_ng_term < 0)
+ {
+ cc_ng_term = Sharder::Instance().LeaderTerm(node_group_id_);
+ }Also applies to: 996-1016
🤖 Prompt for AI Agents
In `@tx_service/include/cc/cc_request.h` around lines 565 - 585, The
ValidTermCheck currently unconditionally accepts candidate leader terms by
taking the max of Sharder::Instance().LeaderTerm and CandidateLeaderTerm; add an
explicit boolean flag (e.g., allow_run_on_candidate_ or a parameter to
ValidTermCheck) and change ValidTermCheck (and any callers) to only consider
CandidateLeaderTerm when that flag is true; update callers such as AcquireAllCc
and PostWriteAllCc to pass/set the flag only when they should permit running on
candidate leaders (mirroring AcquireCc's behavior), and ensure ng_term_ and the
comparison logic remain unchanged when the flag is false so candidate terms are
not accepted implicitly.
| if (!Sharder::Instance().CheckLeaderTerm( | ||
| task->node_group_id_, task->node_group_term_)) | ||
| task->node_group_id_, | ||
| task->node_group_term_, | ||
| true)) |
There was a problem hiding this comment.
Don’t always allow candidate term in post‑processing checks.
Passing true unconditionally to CheckLeaderTerm can let leader‑only tasks complete after leadership loss (depending on term semantics). Consider gating allow‑candidate by task intent (e.g., !task->sync_on_leader_ or an explicit allow‑candidate flag) so leader‑only tasks still fail fast.
💡 Suggested fix pattern
- if (!Sharder::Instance().CheckLeaderTerm(task->node_group_id_,
- task->node_group_term_,
- true))
+ if (!Sharder::Instance().CheckLeaderTerm(task->node_group_id_,
+ task->node_group_term_,
+ !task->sync_on_leader_))Also applies to: 3291-3292, 3394-3395, 3468-3469, 4489-4490
🤖 Prompt for AI Agents
In `@tx_service/src/cc/local_cc_shards.cpp` around lines 3201 - 3204, The
CheckLeaderTerm call currently passes 'true' unconditionally which permits
candidate-term acceptance; change the call to only allow candidate term when the
task explicitly allows it (e.g., use !task->sync_on_leader_ or an explicit
allow_candidate flag) so leader-only tasks still fail fast—update the call site
Sharder::Instance().CheckLeaderTerm(task->node_group_id_,
task->node_group_term_, /*allow_candidate=*/!task->sync_on_leader_) and apply
the same conditional change to the other occurrences (the other CheckLeaderTerm
call sites flagged in the review) so leader-only logic uses false for
allow_candidate.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tx_service/src/cc/local_cc_shards.cpp`:
- Around line 3852-3855: The condition incorrectly excludes term 0; update the
check on Sharder::Instance().CandidateLeaderTerm(ng_id) to use >= 0 (not > 0) so
term 0 is treated as a valid candidate leader term; modify the if condition
around CandidateLeaderTerm(ng_id) and ensure
data_sync_task->SetRunOnLeaderNode(false) still executes for any non-negative
term.
🧹 Nitpick comments (1)
tx_service/src/cc/local_cc_shards.cpp (1)
3254-3256: UpdateLastDataSyncTS is monotonic and safe with 0 values; clarify calls to skip on candidate runs.Both
UpdateLastDataSyncTSimplementations guard withif (last_sync_ts > last_sync_ts_), preventing any regression. Passing0on candidate runs is functionally safe and will be ignored by the monotonic check. However, to improve clarity and avoid unnecessary function calls, skip the call entirely unlessrun_on_leader_node_is true. This aligns with the PR intent and existing patterns (e.g., local_cc_shards.h:986-989).Apply to: 3254-3256, 3411-3413, 3930-3933, 4514-4516, 5737-5739.
| if (Sharder::Instance().CandidateLeaderTerm(ng_id) > 0) | ||
| { | ||
| data_sync_task->SetRunOnLeaderNode(false); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for CandidateLeaderTerm usage to understand term semantics
rg "CandidateLeaderTerm" -C 3 -nRepository: eloqdata/tx_service
Length of output: 36871
🏁 Script executed:
#!/bin/bash
# Look for term-related constants (INIT_TERM, etc.)
rg "INIT_TERM|MIN_TERM|INVALID_TERM" -nRepository: eloqdata/tx_service
Length of output: 1442
🏁 Script executed:
#!/bin/bash
# Search for other candidate term comparisons to see common patterns
rg "CandidateLeaderTerm.*[<>=]" -nRepository: eloqdata/tx_service
Length of output: 2199
🏁 Script executed:
#!/bin/bash
# Check the context around line 3852 to understand the logic better
sed -n '3840,3870p' tx_service/src/cc/local_cc_shards.cppRepository: eloqdata/tx_service
Length of output: 1098
Change CandidateLeaderTerm(ng_id) > 0 to >= 0 to include term 0. Term 0 is a valid candidate leader term (INIT_TERM = -1 is invalid). Other validity checks in the codebase use >= 0 for this purpose (e.g., line 1595 assert, line 275 in tx_execution.h), making > 0 inconsistent and causing it to incorrectly skip the first candidate leader term.
🤖 Prompt for AI Agents
In `@tx_service/src/cc/local_cc_shards.cpp` around lines 3852 - 3855, The
condition incorrectly excludes term 0; update the check on
Sharder::Instance().CandidateLeaderTerm(ng_id) to use >= 0 (not > 0) so term 0
is treated as a valid candidate leader term; modify the if condition around
CandidateLeaderTerm(ng_id) and ensure data_sync_task->SetRunOnLeaderNode(false)
still executes for any non-negative term.
To prevent OUT_OF_MEMORY from causing log replay can not completed during
ccnoderecovery,DataSyncoperations should be allowed under candidate leader conditions. Additionally:range/catalog.Executemethod ofccrequest, using theValidTermCheckmethod.allow_run_on_candidate_toccrequestto determine whether to check thecandidate termwhen performing term checks.Summary by CodeRabbit